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Abstract 

Properties of data are frequently seen to vary depending on the sampled 
situations, which usually changes along a time evolution or owing to envi- 
ronmental effects. One way to analyze such data is to find invariances, or 
representative features kept constant over changes. The aim of this paper 
is to identify one such feature, namely interactions or dependencies among 
variables that are common across multiple datasets collected under different 
conditions. To that end, we propose a common substructure learning (CSSL) 
framework based on a graphical Gaussian model. We further present a sim- 
ple learning algorithm based on the Dual Augmented Lagrangian and the 
Alternating Direction Method of Multipliers. We confirm the performance 
of CSSL over other existing techniques in finding unchanging dependency 
structures in multiple datasets through numerical simulations on synthetic 
data and through a real world application to anomaly detection in automobile 
sensors. 

Keywords: Graphical Gaussian Model, Common Substructure, Dual 
Augmented Lagrangian, Alternating Direction Method of Multipliers 



1. Introduction 



In several real world data, suc h as that from the stock market (jBaillie and Bollerslevi . 

19891 ) , gene regulatory net works flAhmed and Xind . l2009l : IZhang et al.l . l2009l ) , 
biomedic al measurements ( jVaroquaux et al.l . |2010| ) , or sensors in engineering 



systems (jlde et al.l . |2009| ). there are dynamical properties over time evolu- 



tions or due to changes in the surrounding environments. Such effects cause 
data to have different behaviors in each dataset collected under different con- 
ditions. One way to analyze such data is to explicitly include the change into 
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the model (JHamiltonl . 1 1994 iDurbin et all I2OOII ). which usually requires de- 
tailed domain knowledge that is rarely available in most cases. Another way 
is to impose general and mild assumptions on the data. This kind of approach 
is especially c ommon in the multi-task learning literatures (jCaruanal . Il997 ; 
Turlach et al.l . |2005), where the relationships among datasets are treated as 
a clue for combining multiple tasks into a single problem. The scope of the 
present paper is in the latter context where the relationship among datasets 
is the objective we want to analyze. For the purpose, we focus on invariance 
of the data against the underlying cha nges which provides pa r tial yet im 



portant aspects of the data behaviors (jvon Biinau et al.l . l2009l : iHara et al. 



2OI2I ). We provide a technique for finding one of such invariance, specifically 
constant interactions or dependencies among variables across several different 
conditions. An illustrative example is an engineering system where system er - 
rors are observed as dependency anomalies in sensor values (jlde et al.l . |2009| ) , 
which are usually caused by a fault in a subsystem. The invariance, which 
in this example is the remaining healthy subsystems, is captured by a steady 
dependency over the multiple datasets sampled before and after the error 
onset. Hence, we can use such information as a clue for finding erroneous 
subsystems. 

Graphical mod eling is a popular approach for analyzing dependencies in 
multivariate data ( JLauritzenl . Il996l ). We adopt one of the most fundamen- 
tal models, a graphical Gaussian model (GGM), as the basis of our frame- 
work. A GGM is a basic model representing linear dependencies among 
continuous random variables, and has been widely studied owing to the sim- 
ple nature, that is, the dependency structure is represented by the zero 
patterns in an inverse covariance m atrix. Identificat ion of such zero pat- 



terns from data was first studied by iDempsterl (Il972l ) as a Covariance Se- 
lection where the task is formulated as the combinatorial problem of opti- 
mizing the location of zeros in a matrix. Since classical algorithms for this 
do not scale to hi gh dimensional data, the scope of s tudies has shifted to 
a relaxed setti n g (JMeinshausen and Biihlmannl . 120061 : lYuan and Linl . 12007 : 
Banerjee et al.l . |2008| ). where Covariance Selection is formulated as a con- 
vex optimization problem using a £i-regularization that induces zeros in the 
resulting matrix. Because of the effectiveness of the relaxed formulation, sev- 



eral related optirn i zation techniques have a l so been studied (iFriedman et al 



2008 : lpuchi et al.l.l2008al:lLi and Tohl. 12010 ilScheinberg and Rishl .l2010l:lYuan 
2OO9I : IScheinberg et al.l . I2OIOI : iHsieh et al.l . boili ). 

In our context, the objective is not to estimate the structure of a GGM 



from a single dataset, but to decompose the resulting GGMs from several 
datasets into common and individual substructures, with the former repre- 
senting the invariance we aim to detect. The re are some prior studie s on 
learning a set of GGMs frorn multiple datasets. IVaroquaux et al.l (120101) and 
Honorio and Samaras! (120101 ) imported the idea of Group-Lasso (Yuan and Linl . 
2006|;|Bach|,l2Q08|) and Multitask-Lasso (IXurlach et aIl . l2005l : lLiu et al.l . l2009h . 
and extended the framework of a single GGM setting. In both cases, the prob- 
lem is formulat ed under the ass umption that all matrices share the same 
zero patterns. 



GuoetaL 



(120111 ) considered a method to avoid this addi- 
tional assumption, although the problem then loses convexity. Though these 
approaches achieved some success in improving the estimation accuracy of 
graphical models, this does not necessarily mean that they are suitable for 
finding commonness across datasets as we w ill see in the simulation . In 
the context of common sub structure detection. IZhang and Wand (120101 ) pro- 
posed using a Fused-Lasso (JTibshirani et al.l . l2005l ) type of technique to find 
an invariant pattern between two d ataset s. As a general framework for A^ 
datasets situations, IChiquet et al.l (120 ll() considered imp o sing s ign coher- 



ence on the result ing structures, w h ile iHara and Washiol ( 120111 ) extended 
the framework of IZhang and Wand ( 20101 ) to the general situation of A^ 
datasets |1|. In the opposite co ntext where the target is dynamics rather than 
invariance, IZhou et al.l (120101 ) proposed using weighted statistics to trace the 
evolution of a GGM. We note there ar e also several related studies in the 
binar y Markov random field literatures ( Guo et al. j_ 20071: lAhmed and Xingl . 
20091 ) . They also use £ i -regularization (IWainwright et al.l . 120071 ) and Fused- 
Lasso type techniques ( Ahmed and Xind . |2009|) for recovering temporal de- 
pendency structures, which are technically quite close to the ones of GGM. 
The contribution of this paper is two folds. First, we introduce the novel 
Common Substructure Learning (CSSL) framework that is applicable for a 
general case of A^ datasets. Second, a sophisticated a l gorith m based on the 
Dual Augmented Lagrangian (DAL) (JTomioka et al.l. 120111) and the Alter- 
nating Dire c tion M ethod of Multipliers (ADMM) ( iGabay and Mercierl . Il976 : 
Boyd et al.l . l201ll ) is proposed. In the proposed algorithm, the inner prob- 
lems for each iterative update are simple and can be solved efficiently which 



^This paper is an extension of iHara and Washiol (|2011[ ) with more general settings, an 
efficient optimization algorithm, and exhaustive simulations on synthetic and real world 
datasets. 



Table 1: Mathematical Notation 



Notation Description 



\x\ 



£p-norm of a vector x E M"', ||a;|L = ( X]j= 
for p e [1, oo) and ||a3||^ = maxi<j<d \xi\ 
vectorized £p-norni of a matrix A G M^^"^, 
\\A\\= \UA^^,Au,...,AMV\\ 



^i 



|A||g spectral norm of a matrix A G 



idxd 



AyO 
sgn (a) 

diag (x) 



||y4||g = maxi<j<dcrj(yl) where ai{A) is 
an ith singular value of A 
£i,p-norm of matrices B = {Bi; Bj G 



idxd^N 



\B\ 



hP 



J2j,j'=i IK^iji'' ^2,ji', . . .,B 



}i 



= l5 



N,jj') 



a matrix A is symmetric and positive definite 
sign function on a scalar a, sgn (a) = 1 for a > 1, 
sgn (a) = —1 for a < and sgn (a) = for a = 
d X d matrix with a; G M*^ on its diagonal 



results in fast computation. We confirm the validity of the CSSL approach 
through simulations on synthetic datasets and on an anomaly detection task 
in real-world data. 

The remainder of the paper is organized as follows. In Section^ we briefly 
review properties of GGMs and existing learning techniques. In Section [3l 
we present the proposed framework and its theoretical properties. The op- 
timization algorithm based on DAL-ADMM is introduced in Section |H The 
validity of the proposed method is presented through synthetic experiments 
in Section |5l In Section |6l we apply the proposed method to an anomaly 
detection task on sensor error data. Finally, we conclude the paper in Sec- 
tion [3 



2. Structure Learning of Graphical Gaussian Model 



In this section, w e review the GGM estimation problem (JMeinshausen and Biihlmannl . 



20061 : lYuan and Linl . 120071: iBaneriee et ahl. l2008f) and some prior extensions 



to multiple datasets (IVaroquaux et al. 



2010l : iHonorio and Samarad . 12010 : 



Zhang and Wangl . 120101 ). 

We also summarize mathematical notations used throughout the paper 
in Table [1] 



2.1. Graphical Gaussian Model 

In multivariate analysis, covariance and correlation are commonly used 
as indicators for a relationship between two random variables. However, in 
general, a covariance between two random variables Xj and Xj' is affected by 
other variables. Therefore, we need to remove such effects to estimate an es- 
sential dependency structure, which is available by searching for conditional 
dependency among random variables. In a general graphical model, we ex- 
press these dependencies using a graph with vertices corresponding to each 
random variable and edges spanning random variables that are conditionally 
dependent. 

Here, we assume that a d-dimensional random variable x = (xi, 0:2, . . . , x^)^ 
follows a zero mean Gaussian distribution, that is, x ~ A/'(Od, A~^) for some 
symmetric and strictly positive definite matrix A G M'^^^. We refer to a 
graphical model of Gaussian variables as graphical Gaussian model (GGM) 
Note that the zero mean assumption can be achieved without loss of general- 
ity by subtracting a sample mean from the dataset. Here, a covariance matrix 
is parameterized as the inverse of a precision matrix A since this is a more 
primitive parameter representing essential dependency among variables. A 
precision matrix relates to the conditional expectation as 

Ajjf oc —KlxjXj'lotheT variables] , 

that is, the (j, j')th entry of A is proportional to the covariance between Xj 
and Xj/ with the remaining d — 2 variables fixed. With this property, the 
conditional independence between Gaussian random variables is expressed 
as zero entries of A: 

Kjji = <^ ^i -LL Xj' I other variables 

where _LL denotes statistical independence. Because of this property, the edge 
patterns in a GGM correspond to the non-zero entries in a precision matrix 
A. In a GGM, two vertices have an edge between them if and only if the 
corresponding (j, j')th entry of A is non-zero. In the case that only few pairs 
of variables are dependent, most off-diagonal elements in A are zeros and the 
corresponding graph expression is sparse, which allows us to visually inspect 
the underlying relations. 



2.2. Sparse Estimation of GGM 

A naive way to estimate a precision matrix A is a maximum likelihood 
estimation formulated as 



A = argmax £(A; S) , 
Agp 

£(A;^) = logdetA-tr[5A] . 



(1) 



Here, £(A; S) is a log-likelihood of a Gaussian distribution (up to a constant), 
S* is a sample covariance matrix and P is a set of symmetric positive definite 
matrices V = {A G W^^"^] A >- 0}. The positive definiteness constraint is 
imposed so that the resulting A is a valid precision matrix. For a strictly 
positive definite matrix S, the solution to this problem is A = S^^. However, 
in a finite sample case, even when the true parameter is zero, that is, Ajj/ = 0, 
its maximum likelihood estimator Ajj/ is non-zero with probability one. In 
this situation, the resulting graphical model is a complete graph, which states 
that every pairs of variables is conditionally dependent and the underlying 
intrinsic relationships are masked. 

The major scope of GGM studies is how to avoid this unfavorable result 
from a maximum likelihood estimation and i nfer a sparse gra ph structure, 
which is referred to as Covariance Selection (JDempsteii . 119721 ). In classical 
studies, some entries of a precision matrix A are fixed as zeros and the remain- 
ing non-zero entries are estimated, where the zero pattern is optimized in a 
combinatorial manner. However, this combinatorial problem is not feasible 
for high-dimensional data. 

In recent studies, the use of an ^i-regularization has been shown to 
be practical for Covariance Sele c tion. The first such study was conducted 
by iMeinshausen and Biihlmannl ( 20061). Iri their approach, the solution is 
obtained by solving the Lasso (JTibshiranil . Il996[ ). Here, let us denote d- 
dimensional data with n data points using annxd matrix X = ^ xi X2 ... 
with Xj as its jth column and X\j as its remaining d 
column, we solve the following Lasso: 



JLr. 



1 columns. For each 



mm 



e 2 



^ X. 



^\j^\ 



+ p\\G\ 



1 ' 



(2) 



where p > is a regul arization parameter. We then set ze ro patterns of to 



the jth column of A. IMeinshausen and Biihlmannl (120061 ) have also showed 



the asymptotic convergence of their estimator to the true graph structure 



under a proper condition. This approac h was later reformulated as an £i 



regul arized maximum likelihood problem ( jYuan and Lid . 120071 : iBanerjee et al. 
2008h : 



max i(A: S) 
AeV 



p\\M\i ■ 



(3) 



We ref er to this problern as Sp arse Inverse Covariance Selection (SICS) fol- 
lowing IScheinberg et al.l (120101 ). The resulting precision matrix of ([3]) has 
some zero entries owing to the effect of an additional £i-regularization term. 
Several efficient optimization tec hniques are available for solvi ng this prob 
lem. E xampl es include GLasso (JFriedman et al.l. 120081) . PSM ( Duchi et al. 



2008al] IPM JLi and Tohl.l2010h SINCO fIScheinberg and Rishl Eoioh . ADMM 
f lYuanl . I2OO9I : IScheinberg et all . l2010f ) and QUIC flHsieh et al.l . EoTlH . 



2. 3. Learning a Set of GGMs with Same Topological Patterns 

The ordinary SICS problem ^ aims to learn one GGM from a single 
dataset. The extension of t his fr amew ork to multiple datas e ts ha s been 
studied by IVaroquaux et al.l (120 lOl ) and JHonorio and Samaras! ( I2OIOI ). The 
task is to estimate N precision matrices Ai, A2, . . . , Atv from N datasets 
where the sample covariance matrices for each dataset are 5*1, 5*2, ... , Sjsi- The 
objective of this multi-task extension is to improve the estimation accuracy of 
each GGM by incorporating the similarity among datasets. In the framework 
of the above studies, GGMs from each dataset are assumed to have the same 
topological patterns, that is, the same edge connection structures while the 
edge weights might be different for each GGM. They both introduced a Ci^p- 
norm of a set of N precision matrices {Ajj^j^ 



lAI 



i,p 




as a regular ization term analogou s to the Group-Lasso (JYuan and Linl. 12006 : 
Bachl . l2008h and Multitask-Las s o (JTurlach et all l2005l : JLiu et al.l . l2009h with 



p £ [1,00]. IVaroauaux et al.l ( 2010l ) has considered the case p = 2 while 



Honorio and SamarasI (20101) used p = 00. These two choices are commonly 
adopted in many scenarios owing to the computational efficiency. The entire 
estimation problem is defined as 



N 



max ^ ^ti£(Ai;S'i) - 



{A,;A,gP}™^i 



1,P 



(4) 



i=l 



with non-negative weights ti,t2, ■ ■ ■ ,t^. Without loss of generahty, we can 
hmit ourselves to the normalized case X]i=i ti = ^ since the unnormalized 
version is just a scaled objective function for some constant. The typical 
choice of parameters is t, = :=^ — where rii is the number of data points 

in the ith dataset. We refer to problem (jl]) as Multitask Sparse Inverse 
Covariance Selection (MSICS) in the remainder of the paper. 

Note that the MSICS problem (j3]) involves the ordinary SICS ([3]) as a 
special case when p = 1 where the £i i-regularization term completely de- 
couples into A^ individual £i-regularizations. In the extended case for p > 1, 

the regularization term enforces the joint structure Ajj/ = (X]i=i l^jjj'l^)'' 

to be sparse, with Ajj/ = indicating that the corresponding (j, j')th entries 
are zeros across all A^ precision matrices. 

2.4- Learning Structural Changes between Two GGMs 

Although taking advantage of situations with multiple datasets using the 
preceding techniques is useful for improving the estimation performances of 
the resulting GGMs, it only imposes joint zero patterns and does not indi- 
cate anything about the commonness of the non-zero entries. It is there- 
fore not that helpful when comparing GGMs representing similar models 
where we expect that there may exist so r ne com mon edges whose weights are 



close to each other. IZhang and Wangl (J2010l ) considered the two datasets 



case and constructed an a lgorithm using a Fused-Lasso type regulariza- 



tion ( jTibshirani et al.l . 120051 ) to round these similar values to be exactly the 



same allowing only significantly different edge s between two GGMs to be 



extracted. Their approach follows the ideas of iMeinshausen and Biihlmann 



( l2006l ) by connecting the update procedure ([2]) for two datasets Xi and X2 
through a new regularization term for the variation between two parameters 

11^1 — ^2|li, 



2 

^ij^E{^II^M--^^A.^^Il2 + HI^''lli}+7ll^i-^2lli, (5) 



where 7 > is a regularization parameter for the variation. The new term 
enforces the variation of some elements in two parameters to shrink to zeros. 
They also provided a coordinate descent-based optimization procedure for 
the above problem. 



3. Learning Common Patterns in Multiple GGMs 



The preceding work by IZhang and Wand (I2OIOI ) adopted the idea of the 



Fused-Lasso type technique using the specific formulation of the two datasets 
situation. In our study, we introduce a new framework, a Common Substruc- 
ture Learning (CSSL), for finding invariant patterns in multiple dependency 
structures that is applicable to the general case of A^ datasets. 

3.1. Common Substructure Learning Problem 

We first formalize what invariance we are aiming to detect in multiple de- 
pendency structures. To begin with, we assume that the number of variables 
in each dataset is the same, so they are all (i- dimensional. Also, the identities 
of each variable are the same. For instance, Xi is always a value from the 
same sensor while its behavior may change across datasets. We then define 
a common substructure for multiple GGMs as follows. 

Definition 1 (Common Substructure of Multiple GGMs). Let Ai, A2, 

. . . , An be precision matrices corresponding to each CCM. Then, the com- 
mon substructure of the CCMs is expressed by an adjacency matrix G M'^^'^ 
defined as 

Q.., = / ^i.ii' ' '^f ^i.ii' = ^2ji' = . . . = Amjj' /g-, 

■'■' \ , otherwise ' 

Note this is a natural extension of th e invariance notion adopted in the 
prior work by IZhang and Wang (2010) for the case of two datasets. With 



an ordinal sparsity assumption for GGMs, this definition leads the precision 
matrices to simultaneously have sparseness and commonness. That is: 

• Sparseness: Aj jj/ = for some 1 <i < N and 1 < j,j' < d, 

• Commonness: Aij^/ = A2JJ' = . . . = Anjj/ for some 1 < j,j' < d. 

Under the above commonness, the basic idea of our framework is to 
parametrize each precision matrix Aj using two components, a common sub- 
structure 9 and an individual substructure Vti G M'^^'^: 

A, = + fi, . (7) 

Here, each individual substructure matrix Vti is composed of non-zero entries 
that are not common across the A^ precision matrices. 



In the preceding formulation (jS]) , some entries in the two precision matri- 
ces are shrunk to the same value owing to the effect of the term \\6i — 02|li- 
In the proposed parameterization, such commonness corresponds to the case 
when some entries of the individual substructures are simultaneously zero, 



that is, ill 



3f 



^ 



2,jj' 



n 



N,jj' 



0. Hence, the no n- zero common value 



is expressed by a common substructure matrix O. These facts motivate us 
to regularize the individual substructures through the grouped regulariza- 
tion ll^llj^p. On the other hand, we expect a common substructure G to be 
sparse so that we can interpret it easily. To that end, we adopt an ordinary 
£i-regularization \\Q\\i and the overall problem is summarized as follows: 

N 



max ^tii{Q + Qi] Si) -p||0||i -7ll^lli,p 



e,{n.h^, 



j=i 



s.t. Q + QieV il<i<N) 



(8) 



with regularization parameters p, 7 > 0. Since —£(6 + Qi;Si), ||6||i and 
||r2||ip are all convex, the entire formulation is again a convex optimiza- 
tion problem. We refer to this problem as Common Substructure Learning 
(CSSL). Note that in the above formulation, we have slightly relaxed the 
condition of commonness to allow Qjji and ^ijj' to become simultaneously 
non-zeros which is contrary to Definition (Q. We correct this point by ap- 
plying the criterion (|6]) to the resulting precision matrices Ai, A2, . . . , A^v in 
the post processing stage to extract only truly common entries. 

Here, we hst two important properties of the CSSL problem (|H]), a dual 
problem and the bound on eigenvalues. We first present the dual problem, 
which plays an important role in constructing an efficient optimization algo- 
rithm in the next section. 

Propostion 1 (Dual of CSSL). The dual problem of CSSL ^ is 

N 

min — > ti log det Wi — d , 

{Wr,W,GV}t, j^ 

N 



S.t. 



N 



Y^t1\w,,r - S.. 



i,jj' 



j=i 




(9) 



10 



where q is a parameter satisfying p^^ + q^^ = 1. The resulting matrices of 
the dual problem W* are related to the optimal precision matrices A* through 
the inverse, A* = W*~^ . 

In both the primal and dual formulations ([8]), ([9]), we enforced the positive 
definiteness constraints, Aj = + f2j G P and Wi & V so that the matrices 
are valid precision or covariance matrices. Here, we show that they can be 
tightened according to the next theorem. 

Theorem 1 (Bounds on Eigenvalues). The optimal precision matrices for 
the CSSL (OJ A^, Ag, • • • , A^ with < p < Np'J < cx) have bounded eigen- 
values Xf^™Id ^ A* ^ X^^^Id, where the bounding parameters A™° and A™*^ 
are 

f- Npd'^ 



I 



t,\\Si\L + d^' ' p 



Using this result, we can replace the constraint Aj G V with the tighter 
Ai G P, = {A G W^'^'^-.A >z Xf^'Id}, and similarly Wi e {A e M'^^'^'; A >z 
^max-ij^j^ Note that this update is practically important when constructing 
an optimization algorithm. Since the new constraint set Vi is closed, we 
can project points out of the constraint set onto the boundary, which is 
unavailable for the original open set V. 

3.2. Interpretations of CSSL 

The proposed CSSL problem ([8]) can be interpreted as a generalization of 
an ordinary SICS problem ([3]) and its multi-task extension MSICS (jl]) . In the 
case that 7 — )■ oo, the solution to the CSSL is i7i = f22 = . . . = VIn = O^xd, 
which means that all precision matrices are equal and are represented by a 
single matrix 6. Such B is available by solving the SICS problem ([3]) with 
5" = ^j=i tiSi. On the other hand, if p > Np'J, the common substructure 9 
becomes zero. This fact follows from the relationship between the ip-norms: 

7 lie + aiii^p < N-p^ \\e\\, + 7 linjii^p < p jieiii + 7 ||fi|li,p . 

Suppose that the common substructure is non-zero, that is, 6 7^ Odxd, then 
the above inequality means that the update fij ^ 9 + Qi and 9 ^— O^xd 
improves the objective function value ([8]) without changing the resulting 
precision matrix Aj = Q + Qi, and thus the solution must be 9 = Odxd- Under 

11 



this situation, the CSSL problem (^ coincides with MSICS (jl]). For the 
proper parameters p < Np'y < oo, the CSSL problem ([8]) is the intermediate 
of those two problems. 

The CSSL problem can also be interpreted from a distributional per- 
spective. From the relationship between the Lagrangian expression and the 
constrained optimization problem, the CSSL problem ([H]) is equivalent to 
solving a set of A^ maximum likelihood estimation problems ([1]) under the 
additional constraints 

\mi<v, iifiiii,,<^', (10) 

for some properly chosen positive constants rj, rj'. Moreover, we have 



max ||r2j — r2j'||-| < max > (|r2j i,v| + |r2j/ 
<i<i'<N ^ i<i<j'<Af ^^^ '■^■' 



l<i<i'<N l<i< 



< 2||fi|L < 2||fi|L , 

— II lll,oo — II lll,p ' 



where the second inequality comes from the fact that exchanging the order 
of maxi<j<j'<jv and ^ . v^^ produces the upper bound. The last inequality 
is an ordinary relationship between £p-norms. These relations and the fact 
that Aj — Aj/ = Qi — Qi' lead to the bound 



max IIAj— Aj/|L < 2?]' . 

l<i<i'<N 



Hence, from the result of lHonorid (120 111 . Lemma 23) and general matrix norm 



rules, the left-hand side of this inequality can be interpreted as the upper 
bound of the KL divergence between two distributions Pi{x) = A/'(Od, A~^) 
and Pi'{x) = J\f{Od, A.^^). With these properties, we can interpret the second 
constraint in fITOl) as a constraint on the similarity among distributions: 

max DKL{Pi{x)\\pi'{x)) < 2r]' max ||Ari||s, 

l<i,i'<N l<i<N ' 

where D}<iL{Pi{x)\\pi'{x)) denotes a KL divergence between two distributions 
Pi{x) and Pi'{x). From Theorem (H the optimal parameters A*, A2, . . . ,A*pf 
have bounded spectral norms for a finite 7, and thus this upper bound on the 
KL divergence is always valid. Moreover, we can further extend this bound 
into the extreme case 7 — ?■ 00 and r/' — )■ 0. As we have discussed before, 
this is the case Qi = Q2 = ■ ■ ■ = ^n = ^dxd and the problem is equivalent 

12 



N 



to solvinK a si n gle S ICS problem for 9 with S = X]i=i ^i- Hence, from 



Banerjee et al.l (120081 . Theorem 1), we can see that the resulting precision 
matrices still have finite eigenvalues for p > 0, and the right hand side of the 
above inequality goes to zero. This means that the resulting distributions 
represented by precision matrices derived from CSSL ([8]) have to be similar 
to one another at some level and they can be even identical in the extreme 
case. Note that MSICS (jl]) is a special case of CSSL when 9 = Odxd and 
thus the same upper bound holds, although there is the significant distinction 
that the parameter r]' in MSICS (jlj) also affects the sparsity of the resulting 
precision matrices while CSSL ([8]) can control the sparsity through the other 
hyper-parameter p. 

3.3. Connection to Additive Sparsity Models 

In this section, we discu ss some connections of the CSSL proble m 



to Additive Sparsity Models (iJalali et al.l. 120101: IChandrasekaran et al 




Agarwal et al.l . l201ll : ICandes et al.l . 1201 ll : lObozinski et al.l . l201ll ) . In general 
additive sparsity models, the objective parameter we want to estimate is 
modeled as the sum of two components, as in ([7]). Hence, these two param- 
eters are estimated using sparsity inducing norms such as an £i-norm and a 
trace-norm. In this sense, CSSL can be seen as a specific example of additive 
sparsity models where we use the combination of an £i-regularization and a 
group-wise regularization. 



Here, we point out two close works from lJalali et al.l ( l2010l ) and lChandrasekaran et al 



(120 lOl ). The former considers the multi-task least squares regression problem 
under the combination of ii, group-wise regularizations. Their basic idea is 
quite close to ours in that some regression parameters can be close to each 
other across datasets. They also prove the advantage of combining two reg- 
ularizations over using only one theoretically and numerically. The latter 
study is on GGMs but with different sparsity assumptions from ours. They 
show that the additive sparsity model naturally appears in GGM when there 
are latent variables. In such a situation, the first component in the additive 
sparsity model corresponds to the precision matrix between observed vari- 
ables while the latter component is an interaction between latent variables. 
This insight is also available for interpreting our model ([7]), that is, a com- 
mon interaction among observed variables is contaminated by the effect of 
latent variables which are different for each dataset. 
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4. Optimization via DAL-ADMM 

In this section, we present the optimization algorithm for solving the 
CSSL problem ([9]). Ou r basic approac h here i s to ad opt the Augmented 
Lagrangian technique s (JHestened . Il969l : IPowelll . 119671 ). In a prior study, 
Tomioka et al.l (120111 ) have shown that solving a dual problem using the 
Augmented Lagrangian, which is referred to as Dual Augmented Lagrangian 
(D AL), is preferab l e for t he case when the primal loss is badly conditioned. 
See iTomioka et al.l (J201ll Table 3) and the discussion therein. This is actu- 
ally the case we are faced with, as summarized in the next theorem. 

Theorem 2. The Hessian matrix of the CSSL primal loss function Yli=i tii{Q+ 
Qf, Si) is rank- deficient while the Hessian matrix of the CSSL dual loss func- 
tion — X]j=i ti logdet Wi is always full rank for < p < Npj < oo. 

This fact motivates us to solve the dual problem rather than the primal prob- 
lem. To that end, we construct an algorithm based on the DAL approach. 

4.1. DAL-ADMM Algorithm 

The basic structure of the proposed algorithm is based on the idea of 
DAL. However, while the orig i nal D AL requires solving the inner problem 



almost e xactly (ITomioka et al.l. l201 



), we take an alte rnative approach using 



Boyd et al.l . I2OIII ) that makes the entire 



ADMM (JGabav and Mercieil . Il976l : 
procedure dramatically simple. 

To begin with, we rewrite the CSSL dual problem (Q in the following 
equivalent form: 

TV 

min — N ti log det Wi 

{w„Y^■,w,eV}fL^ ^ 



s.t. tiWi -Y,-tiSi = Q {l<i< N) 



N 



Y^y- 



ijj' 



j=l 



N 



<p> Ei^K 



,jj' I 



<7 {l<3,j'<d). 



(11) 



i=l 



Based on this expression, we define the following Augmented Lagrangian 
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function: 



Cfi{W, Y,Z) = -J2U log det W, + 5p{Y) + ^{Y) 



i=l 

+ tr [Z^ (TW -Y - TS)] + - \\TW -Y - TE\\l , (12) 

where /3 is a nonnegative parameter and T,,W,Y and Z are the concatenated 
matrices S = [ ^i ^2 ••• Sn]^, W = [Wi W2 ... Wn j'^ , Y = 
[ Yi Y2 ... Fat ] and Z = [ Zi Z^ ... Zfq ] , and T is as the matrix 
T = diag ([ti, ^2, . . . , ^Af]""") ® Id, where ® denotes the Kronecker product and 
Id is the c?-dimensional identity matrix. We also defined the functions Sp{Y) 
and S^(Y) as 



my) 



0, if 

00 , otherwise 



Er=i Yhji' < p for 1 < j, j' < d 



, if (Eti l>^..yr) ' < 7 for 1 < j,j' < d 

00 , otherwise 



In the Augmented Lagrangian function ( TT2i) . the optimal precision matrix A* 
is represented by the optimal dual variable Z*. This can be verified through 
a simple calculation. We set the derivative of the unaugmented Lagrangian 
£0(1^5 Y, Z) over Wi to zeros and find that 



wr' = z: , 

which imphes that A* = Z* from Proposition [H This follows since the 
solution to ( ITT]) must be the saddle point of the unaugmented Lagrangian 

function £o(W^,>',^)- 

We solve problem flTTj) using ADMM by iteratively applying the following 
three steps until convergence: 

y^/ik+i) ^ argmin Ci3{W,Y^''\ Z^''^) 

Y(k+i) (z argmin Cp{W^''+^\ Y, Z'-''^) 

Y 

^(fc+i) ^ 2{k) _^ p (TV^{fc+i) _ y(fc+i) _ TE) 
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Hence, using ADMM, convergence of the dual variable Z to the opt imal pa- 



rame ter Z* is guaranteed as the number of iterations tends to infinity ( jBoyd et al. 



201 ll . Section 3.2). This means we can find the optimal precision matrices 
A^, Ag, . . . , A^ using DAL- ADMM. In the following two subsections, we give 
the update procedures for W and Y . 

4-2. Inner Optimization Problem: Update of W 

The update of W can be factorized into A^ independent problems where 
each problem defines an update of Wf. 

2 
ti^i 



min —ti log det Wi + t^tr 



zf^ W, 



2 



uw, - y. 



(fc) 



By setting the derivative over Wi to zero, we obtain 



1, 



.(fc) 



W, - -Y.r' - -77 Zr + 5* - -77rWi 



(k) 



/3t. 



1 



^-1 



f3ti 



0, 



dxd ■ 



Now, write the eigen-decomposition as j-F/ — ^Z^ + Si = PDP^ with 
D = diag (cTi, (72, ... , era) and P^ P = PP^ = I^. Then, the above matrix 
equation has a solution of the form Wi = PDP^ with D = diag (cti, 5"2, . . . , a^) 
The equation for each eigenvalue is o'm — crm — jf^^ = (1 < m < (i), which 
has the analytic solution 



CTr, 



(^m + ^/<^m + ^ 



Note the positive definiteness of Wi is automatically fulfilled since 0"^ > 
for /3 > 0. 

4-3. Inner Optimization Problem: Update ofY 
The update of Y is formulated as 

/3 



min 6piY) + 6'JY) - tr 



Y 



Z{k)^Y 



2 II Il2 



or equivalents, the projection Y = proj (Yo,A) of Fq = TW^'^+^^ + ^Z^''^ -TE 

onto the set ^ = |r = [Fi Y2 ... l^iv ]^ ; |Eti ^i^'l < P . (EL l^^y 

where proj(*, *) is a projection function defined as 

1 2 

proj (V, i3) = argmin- \\U — V\\2 ■ 



19^ ' 



<7 ,Vj,j'L 
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Table 2: Solutions to problem (J13p for q = 1,2 and oo: see the corresponding appendix 
for further details. An operator T~^{*) in y e dC2 for g = oo is a thresholding for each yo.ii 
that is, yi = sgn(2/o,j) min(|yo,j|,7)- 



1 



oo 



Vo^C 



y = yo 



yedCi 



y = yo 



Ij^yo-psgnjlly,)^^ ^ ^pp^^^.^ ^^ ^ 



yedC2 



Continuous Quadratic 
Knapsack Problem 



(Appendix A.2) 



y 



7 



I2/0II2 



yo 



(Appendix A. 3) 



y = Tyiyo) 



(Appendix A.4) 



y^dC3 



Continuous Quadratic 
Knapsack Problem 



Analytic Solution 



(Appendix A. 5) 



(Appendix A. 6) 



Continuous Quadratic 
Knapsack Problem 



(Appendix A. 7) 



We can further decompose this problem into 0{d'^) problems over y 



(Yi,n',Y, 



2,jj'^ 



Yj 



N,jj') 



for each (j, j')th entry. Hence, each problem is 



y = proj (Vq, C) 



(13) 



where y^ is an A^-dimensional vector with the ith component equal to yo^. 



TON. 



1 r-^ik) , q 



and where the constraint set is C 



{u e 



IItv'^I — P' ll'^IL — 7} "with Itv being an iV- dimensional vector of ones. 
For any q G [1, 00], problem f lT3|) has a trivial solution y = y^ ii y^ E C. 
In the remaining cases, that is, |l^i/ol > P or Wy^W > 7, the solution is 
on the boundary of the constraint set dC = {u; \ljfu\ = p, \\u\\ < 7} fl 



{u; 



i^v""! 



< p, \\u\\ = 7} owing to the convexity of the objective function. 



Thus, the problem can be reduced to a search of the boundary. However, 
even though the constraint set C is convex, it is an intersection of two sets 
and the shape of the boundary dC is rather complicated. Therefore, we do 
not search the boundary dC directly, but solve a set of simpler problems 
instead. The basic approach is to classify the boundary into three parts. 



dCs 



{u;\l 



TV 



U\ 



P> M\n ^ 7}, dC2 = {u; \lJfU\ ^ p, \\u\\ = 7} and 



-2 — 1"-, {---N^i I ri ii-iig 

Ljv"-! — p, \\u\\ = 7}. The problems we solve here are modified 
versions of ( 1T3|) . replacing the constraint with y G dCm for each m G {1, 2, 3}: 



rfi 



y = proj (2/0, dCra) . 



(14) 
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Note that dCi and dC2 involve infeasible solutions to the problem f ll3p . For 
example, a point y with ||y|| > 7 is infeasible even if y G 9Ci, while these 
three regions covers the entire boundary of the constraint set dC C VJ^^idCm- 
This guarantees that we can search the entire boundary dC indirectly by 
searching the sets dCm {'m = 1, 2, 3) instead. Hence, if neither of the solutions 
to (111]) for y G dCi and y G dC2 are involved in C, the solution to ( 1T3|) is 
in dC^. We can take advantage of this property to construct an efficient 
solution procedure. We first solve problems f lT4|) for y G dCi and y G dC2, 
respectively, and if neither of solutions is in C, then we solve f ll4p for y G dC^. 
In this paper, we focus on the specific cases q = 1,2 and 00, since efficient 
solution procedures are available. In Table [2l we summarized the solutions 



to problem f lT3|) . For further details, see Appendix A 



4 ■4- Convergence Criteria 

Although the asymptotic convergence of Z^''^ as A; — )■ 00 is theoretically 
guaranteed, in practice we need to stop the iteration at some point. A major 
stopping criterion is the duality-gap, the difference between the primal and 
dual objective function values. Let f(W) be the objective function in ([9]) 
and let g{Q, fi) be the one in (|S]). Then the duality-gap at the kth iteration 
is defined as 

duality-gap = /(H^^^'^) - max g{Q'^'''\Q^'''^) , 

where W^''\ Q^^^ and fi*^^^ denote parameters estimated in the kth step af- 
ter proper projections and transformations. We need these modifications of 
variables since the estimators in intermediate steps are not necessarily feasi- 
ble. For example, W^''^ does not need to satisfy the constraints in (|9]) since 
they are imposed only on a variable Y in the DAL-ADMM setting fITT]) . The 

projected variable W^''^ is W^''^ = T'^Y^^^ + S where Y^^~^ = proj (Yq^''\a] 
and Y^''^ = T{W^^^ - S). The same goes for A('=) = Z^''\ An estima- 

(k) 

tor A) is not necessarily positive definite, and thus we project them as 
Al = proj (Ai ^Vi). This projection is available in the following man- 

(k) T 

ner. Let A^^ = PDF be an eigen-decomposition with a diagonal matrix 
D = diag(cri, o"2, • • • , cTd)- Then the projected matrix is A^^ = PDP^ , where 
each element oi D = diag(ai, 5"2, . . . , a^) is cr^ = max(crm, A™"). For com- 
puting the value of g{Q^''\il^''^), we need to further factorize A*^'^^ into G)^'^^ 



and fl^''^ This can be computed in an element- wise manner. Let 9 = Q^J, 



oik) _ 7{k) 

to solve is 



(fe) 



(k) 



33 



9 and A = (A^ jj'i-^2 jj'i-^n jj') ■ Then the problem we need 



min p\9\ + 7 ||A — 91 



N\ 



For p = 1 and cxo, this function is piecewise linear with breakpoints {0, Ai, A2, . . . , \n} 
and {0, """' ' ^^^'' '' }, respectively. Hence, the optimal 9 is one of these 
breakpoints and can be found by searching the candidates. For the case 
p = 2, the analytic solution is 



»=^o;^ 



sgn 



i;a 



\ 



[lyxy - N- 



72(l^A)2-p2 



72 A^ - p2 



Some other useful gaps are provided by iBoyd et al.l (1201 if ) . The primal- 
gap measures how much the equality constraints in (ITTl) is fulfilled, 

primal-gap = \\TW^''^ - Y^''^ - ^^H^ , 

while the dual-gap is a degree of the feasibility condition of the solution, 
defined as 

dual-gap = f3 ||T(F('^+^) - Y^^'>)\\^ . 

In our simulations in Sections E] and El we have evaluated both criteria. 
We set two threshold parameters egap and Cpdgap, and evaluated the conditions 
duality-gap < egap and max (primal-gap, dual-gap) < Cpdgap in each iteration. 
If one of two conditions is fulfilled, we regard the iteration as converged and 
output the result. In the simulations in Sections |5] and [6l we set egan = I0~^d 
and epdgap = 10~^ 



cgap 



4-5. Computational Com,plexity 

In this section, we summarize the computational complexity of the pro- 
posed algorithm. In the W update step, the computational cost is dominated 
by the eigen-decomposition of a d x d matrix, which requires 0{d^) opera- 
tions, so the overall complexity is 0{Nd^) for the update of A^ matrices. In 
the Y update step, we need a projection proj {Yq,A) which is divided into 
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0{(P) subprobleins. For both q = 1 and q = oo, the most computationally 
expensive procedure is solving the continuous quadratic knapsack problem 
which requires sorting 0{N) elements and has complexity 0{N\io.N) q In 
the case q = 2, the update is analytically available with 0{N) complexity. 
The overall complexity for the Y update is thus 0{{N In N)d'^) for q = l,oo 
and 0{Nd?) for q = 2. The complexity for the Z update is 0{N(P). In the 
convergence check, we need to calculate the projection proj(Aj- K,Vi) which 
has 0{d^) complexity or 0{Nd^) for A^ matrices. We also need the projec- 
tion proj (y^''\a\ which is again C((A^ln A^)^^) for g = 1, oo and 0{Nd'^) 
for g = 2. Summarizing the above results, we conclude that the computa- 
tional complexity of one update in DAL-ADMM is 0{Nd^ + {N\YiN)d^) for 
g = 1, oo and 0{Nd^) for q = 2. In many practical situations, the number 
of datasets A^ is in the tens, while the dimensionality of the data d can be 
a few hundred. In such cases, InN '^ d holds, and the entire complexity is 
approximately 0{Nd^). We note this is the least necessary complexity. For 
an unregularized setting, the solution A* is a maximum likelihood estimate 
5*"^, which requires 0{d^) complexity for a matrix inverse and 0{Nd'^) for 
A^ matrices. 

Despite the theoretical complexity, the choice of /3 is of practical impor- 
tance since it affects the number of iterations needed until convergence. We 



propose using the heuristic from iBoyd et al.l (120111 ). In this heuristic, we 



update the value of /3 = (3^''^ in every steps following the next rule: 

2/3'^'^^ , if primal-gap > 10 * dual-gap 
^(fc+i) ^ ^ Q_^p(k) ^ if dual-gap > 10 * primal-gap . 
/3'^'^) , otherwise 

While this does not give any theoretical guarantees on its performance, it 
does give us a pragmatic choice of /3 and results in convergence with a smaller 
number of steps. 

4-6. Heuristic Choice of Hyper-parameters 

In the CSSL problem ([8]), the choice of hyper-parameters p and 7 affects 
the resulting precision matrice s. There are several approaches for ch oosing 



these, such as cross-validation (JYuan and Linl . 120071 : iGuo et al.l . I2OIII ) or the 



^See [Appendix AT2| [Appendix A.S] and [Appendix A.T] 
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Bayesian information criterion (JGuo et all 1201 ll ). Apart from selection tech- 
niques, the following result gives us some insight into p and 7, and is helpful 
for analyzing the data more intensively. 

Propostion 2. Let the bivariate common substructure B and individual sub- 



structures Qi be in the forms O 



9 

e 



and Qi 



Ui 

UJi 



Vi 



and con- 



sider the following CSSL problem with regularizations only on off-diagonal 
entries: 



N 



max ^tii{e + Qi] Si) - 2p\9\ - 2-f \\uj\ 



e,{n,}f 



1 i=i 



s.t. Q + QiEV {l<i<N) 



(15) 



where u) 



lWi,W2, 



,C^7V 



Then the off-diagonal entries of the resulting 



precision matrices 9, lj have the following property: 

N 



max n < 7 and 

l<i<N 



i=l 



< 



P 



e = o, uj = o 



N : 



where Tj is the off-diagonal entry of Si. 

Although the result is specific to the bivariate case, we can use this as 
a guideline for choosing the hyper-parameters p and 7. It also shows that 
p and 7 are not independent of each other, but rather they should change 

In particular, if 



^N 



simultaneously proportional to maxi<j<jv |rj| and XliLi ^i'^i 
each matrix Si is multiplied by some positive constant c, the above condition 
indicates that p and 7 also need to be multiplied by c. Such scale invariance is 
maintained only by a linear model between p and 7. Therefore, we construct 
the following heuristic based on this linear model. 



1. Assume that the linear relation 



■■,J3 



Si maxi<j<7v \Sijji\-\-So 



holds for all entries 1 < j < j' < d for some So,Si G M. 

2. Estimate sq, si with least squares regression using the tuples < maxi<j<Ar \Sijj'\, 

3. Parameterize p, 7 as p = max(sia + Sq, 0) and 7 = a using a parameter 
a. 

This procedure provides an efficient way of tuning p and 7 simultaneously 
through a single parameter a. 



l^i=l '^i^i,ij' J 
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5. Simulation 

In this section, we investigate the performance of the proposed CSSL ap- 
proach in finding common substructures among datasets through numerical 
simulations. 

5.1. Generation of Synthetic Data 

We fist briefly summarize the data generation procedure for our simula- 
tions. For the synthetic data, we need N precision matrices with sparseness 
and commonness. We tackle this problem in a two-stage approach. We 
first generate a single sparse precision matrix, and then add some non-zero 
entries to make A^ matrices where the additional patterns are individual 
to each other Q After A^ precision matrices Ai, A2, . . . , A^r have been con- 
structed, we generate A^ datasets from the corresponding Gaussian distribu- 
tions M{Od,A-^) ioil<i<N. 

5.2. Baseline Methods and Evaluation Measurements 

In the simulation, we adopt SICS ^ and MSICS (jl]) as baseline methods 
to compare with CSSL. Since neither method is designed for finding a com- 
mon substructure, we apply a heuristic to extract the substructure from 
the estimated precision matrices Ai, A2, . . . , Aat. Note that, in SICS, each Aj 
is estimated by solving (|3]) individually while the set of matrices is estimated 
simultaneously in MSICS dl]). Following is the heuristic criterion used: 

A ^ _ ( Ojji , if maxi<j<j/<rf |Aj jj/ — Aj/jj/| < e 
"'■' 1 , otherwise 

where e is some given threshold. Here, to avoid selecting zero edges as parts of 
a common substructure, we set Ojji to zero if Aijj' = K2JJ' = . . . = Atvjj' = 
and one otherwise. In our simulation, we select the threshold e from the 
resulting precision matrices. Specifically, we compute variations of estimators 

for each entry < maxi<j<j/<jv |Aj ,•,/ — Aj/ ,-,/| \ , and then set e as the 

I ~ ~ J i<i<i'<d 

100eo% quantile. This corresponds to considering the lower 100eo% varied 

entries as common. 

In our simulation, we evaluate the common substructure detection perfor- 
mance through precision, recall and the F-measure. While these values are 



^See [Appendix 5] for further details. 
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defined based on the number of true positive, false positive and false negative 
detections, we slightly modify these measurements. This is because finding 
common dependencies with higher amplitudes is much more important than 
finding very small dependencies which can be approximated as zero in prac- 
tice. To that end, we adopt following weighted measurements, namely WTP 
(weighted true positive), WFP (weighted false positive), and WFN (weighted 
false negative). 



WTP = 22 Jc,jj'Jp,jj'Jc,jj' ^max^ IKjj'l > 

d 

WFP = ^ Jc,jj'Jp,jj'{l - Jc,jf) max \Aijj,\ , 

■'— ' l<i<N 

d 

WFN = Y^ { Jcj/(1 - Jp,ry) + (1 - Jc,jr)] Jc,ry max |A, 



,n \ 1 



3<3' 

where Jcjj'j <^p,ji' and Jcjj' are defined as 
Jcjj' = / I max |Ajjj/ — Aj/jj/| < e 
Jd ii' = I I niax lAj jji\ > 
Jcjj' = I I max |Ajjj/ — Aj/jj/| = 

\l<i<i'<N 

Here, I{P) is an indicator function that returns 1 for a true statement P and 
otherwise. The modified measurements in the simulation are defined using 
these values as 

„ . . WTP 

Precision 



Recall 



F-measure = 2 



WTP + WFP ' 
WTP 

WTP + WFN ' 
Precision * Recall 



Precision + Recall 



In the simulation, we also observe whether the zero pattern in the preci- 
sion matrices is properly recovered using each method. We use the following 
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F-measure for this evaluation, which we refer to the " Fq- measure" to distin- 
guish it from the one above: 

2TP 

Fn-measure 



2TP + FP + FN ' 

TV d 

TP = J] J] /(A,,y = 0)/(A,,y = 0) , 

N d 
N d 

5.3. Result 

We conducted simulations for three cases with data dimensionality d = 
25, 50 and 100 where the number of datasets is fixed at A^ = 5. For each case, 
we generate precision matrices Ai, A2, . . . , A^r to have 15% non-zero entries 
on average. In the simulation, we randomly generate datasets 100 times and 
applied each method using several different hyper-parameters, where in each 
run we set the number of data points in each dataset to be 5d. For CSSL, 
we use the heuristic with a parameter a varying from 10~^ to 10~° over 41 
values. We also evaluate results for p = a and 7 = 00 to see the effect 
of 7 in an extreme case. As discussed in Section 13. 2[ this corresponds to 
solving a single SICS problem with S = J2i=i '^i^i ^^^ setting the result to 
Ai = A2 = . . . = An = A. For SICS and MSICS, we set the value of p as 
p = a. For each method, we adopt the resulting precision matrices with 15% 
non-zero entries among these 41 values of a. In SICS and MSICS, we also 
vary the thresholding parameter eo between 0.5, 0.7 and 0.9. 

We summarize the results in Table [31 From the table, we can see the 
clear advantage of CSSL for p = 2 and 00 over the other methods. These two 
methods show higher F-measures, which are from their higher precision and 
recall. This contrasts with other methods, SICS and MSICS, which achieve 
high recall, but have relatively poor precision. This means that structure 
detected by those methods involve not only true common substructure but 
also many false detections. This shows the drawback of estimated precision 
matrices derived through SICS and MSICS, that is, their estimators tend to 
be highly varied even for true common entries while this is not the case for 
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CSSL. This phenomenon is especially significant in SICS, which can hardly 
find common substructures owing to its highly varied estimators. The results 
for MSICS under p = oo and eo = 0.9 are still better than the others, al- 
though eo = 0.9 means that 90% of estimated non-zero entries are considered 
common, which is too optimistic. Moreover, we can see that the improve- 
ment of the F-measure is achieved by the growth of recall by contrasting the 
results with eo = 0.5 and 0.9. This means that variations on the true common 
substructure mostly happens in between 50% and 90% of the entire varia- 
tions of the estimated precision matrices, which are highly varied and can 
hardly be considered common. Note that despite the significant difference in 
the common entry detection performance, all methods achieve comparable 
zero pattern identification performance as shown by the Fq- measure. This 
shows that finding common entries is a different problem from the ordinal 
graphical model selection, and that only CSSL does well at both tasks. 

We note that CSSL with p = 1 and 7 = oo give two extreme results. 
In the former setting, the resulting precision matrices achieve higher preci- 
sion with lower recall, which is very conservative, while it is the opposite in 
the latter setting. The first result is caused by the difference of a grouped 
regularization ||f^||ip for p = 1 and p > 1. For p = 1, \\^\\ip completely 
decouples into ordinary £i-regularizations and the resulting precision ma- 
trices do not necessarily have common zero entries in individual substruc- 
tures. Intuitively speaking, the results for p = 1 have common zero entries 
^ijj' = ^2,jj' = . . . = ^N,jj' = only when it is strongly confident, which 
results in a very conservative performance compared with p > 1. On the 
other hand, if 7 = 00, the entire structures are considered to be common, 
which results in fewer false negatives and more false positives. 

6. Application to Anomaly Detection 

In this section, we apply CSSL to an anomaly detection problem. The 
task is to identify contributions of ea ch variable to t he difference between 



two datasets. Correlation anomalies (llde et al.l . l2009l ). or errors on depen 



dencies between variables, are known to be difficult to detect using exist- 
ing approaches, especially with noisy data. To overcome this pr oblem, the 



use of sparse precision matrices was proposed by llde et al.l (120091 ) . since the 
sparse approach reasonably suppresses the pseudo-correlation among vari- 
ables caused by noise and improves the detection rate. Here, we propose 
using CSSL. There is a clear indication that the proposed method can fur- 
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ther suppress the variation in the estimated matrices. In particular, we 
expect that dependency structures among healthy variables are estimated to 
be common, which reduces the risk that such variables are mis-detected and 
only anomalies are enhanced. 



6.1. Anomaly Score 



We adopt the measurement for correlation anomalies proposed by llde et al 



( I2OO9I ). This score is based on the KL-divergence between two conditional 
distributions. Formally, let x^, x^ G M.'^ be Gaussian random variables fol- 
lowing A/'(Orf, A^ ) and J\f{Od,A^ ), respectively. We measure the degree 
of anomaly between their jth variables x^ and x^ using a KL-divergence be- 
tween their conditional distributions pA{xf\x^.) and pb{x^\x?j), where ccA- 
and X?- are the remaining d — 1 variables. To compute the score, we first 
divide the precision matrix A^ and its inverse W^ into a (c? — 1) x (rf — 1) 
dimensional matrix, a d — 1 dimensional vector, and a scalar. 



A^ 



rA lA 



w' 



A' 






where we have rotated the rows and columns of A^ and W^ simultaneously 
so that their original jth rows and columns are located at the last rows and 
columns of the matrix. The matrices A^ and its inverse W^ are also divided 
in a same manner. The score is then given as 



d^ 



AB 



dx^^pp,{x^^)D^i,{j)^{xf\x^^)\\pB{x'^\x^^)) 






o ^ k 



if. K^Z 



B/B 



\i \j \J \J \j \i 



It VM 



A/A 



A 



B 



A 



+ i^u| + ai^(A,*-Af 



Here, the KL-divergence is averaged over the remaining d — 1 variables xt-. 
Since the KL-divergence is not symmetric and d^^ 7^ d^^ holds in general, 
the resulting anomaly score aj is decided as their maximum: 

aj = max{df^ , df^) . 
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6.2. Simulation Setting 



We evaluate the anomaly detection performance using sensor error data ( llde et al 



20091 ) ■ The dataset comprised 42 sensor values collected from a real car in 
79 normal states and 20 faulty states. The fault is caused by mis-wiring of 
the 24th and 25th sensors, resulting in correlation anomalies. Since sam- 
ple covariances are rank-deficient in some datasets, we added 10~^ on their 
diagonal to avoid singularities. 

For simulation, we randomly sample Uj^ datasets from the normal states 
and Uf datasets from the faulty states, and then estimate sparse precision 
matrices using six methods, CSSL with p = 1,2 and oo, SICS ([3]), and 
MSICS (jlj) with p = 2 and oo. For CSSL, we adopt the heuristic and set 
p = max(sia + sq, 0) and 7 = a for a given a, and for SICS and MSICS, we 
set p = a. We test each method for 11 different values of a ranging from 
10~^'^ to 10~°'^. The weight parameters tj in CSSL and MSICS are set as 
ti = 2^ for normal datasets and ti = ^ for faulty datasets to balance the 
effects from the two states. Since the anomaly score is designed only for a 
pair of datasets, we calculate anomaly scores for each of n^ x nf pairs of 
datasets. 

6.3. Result 

We repeated the above procedure 100 times for 4 different settings, [n^, Uf] 
= [4, 1], [12, 3], [20, 5] and [40, 10]. For each run, we evaluated the detection 
performance of each method by drawing an ROC curve and measuring the 
area under the curve (AUC). In Table HI we summarize the best median re- 
sults for each method and setting. The table shows that CSSL with p = 2,oo 
and MSICS with p = 00 achieve better detection performances than the oth- 
ers. In particular, CSSL with p = 2 and 00 achieve AUC = 1 as their median 
performance in some cases. This means that they detect faulty sensors per- 
fectly for more than half of the simulation. To see further differences, we plot 
the median anomaly scores derived from each method for [n^, nf] = [20, 5] in 
Figure [U From these graphs, we observe a clear distinction between success- 
ful methods and other methods on the significance of healthy sensors. The 
22nd and 28th sensors are relatively highly enhanced in SICS and MSICS 
with p = 2, but are not in CSSL and MSICS with p = 00. We conjecture 
that this is the major cause of performance differences. Interestingly, not 
only the 22nd and 28th sensors but most of the other healthy sensors also 
have the same tendencies. That is, CSSL and MSICS with p = 00 reasonably 
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Figure 1: Median anomaly scores for each method for [nn,ni] — [20,5] with best AUCs. 
Each plot is normalized so that the maximum is the same. Dotted lines denote true faulty 
sensors. 

suppress their significance while keeping erroneous sensors enhanced. More- 
over, although the differences are subtle, we can see that CSSL with p = 2 
and oo more successfully suppress the significance of sensors 1 to 21 and 33 
to 42 than does MSICS with p = oo. Thus, as we expected in the beginning, 
CSSL reduces the nuisance effects and highlights only those variables with 
correlation anomalies. The remaining peaks at some healthy variables are 
caused by the effect of the two faulty sensors since their effects may propagate 
to other healthy yet highly related sensors. 

7. Conclusion 

In this paper, we formulated the CSSL problem for multiple GGMs. We 
further provided a simple DAL-ADMM algorithm where each update step 
can be solved in a very efficient manner. Numerical results on synthetic 
datasets indicate the clear advantage of the CSSL approach, in that it can 
achieve high precision and recall at the same time, which existing GGM 
structure learning methods can not achieve. We also applied the proposed 
CSSL technique to the anomaly detection task in sensor error data. Through 
the simulation, we observed that CSSL could efficiently suppress nuisance 
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effects among variables in noisy sensors and successfully enhanced target 
faulty sensors. 

Several future research topics have been indicated, including analyzing 
the asymptotic property of the CSSL prob l em (Ip, and extending t he current 
formulation to the adaptive Las so (Zoul. 120061 : iFan et al.l . |2009[ ) type one 



to guarantee the oracle property (jZoul l2006l ) of the estimator. Applying the 



notion of commonness to more general dependency models, such as those with 
non-linear relations or commonness based on higher-order moment statistics, 
is also important. 
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Appendix A. Solutions to ( llSh for q = 1,2 and oo 

Here, we give detailed derivations of Table |5J 

Appendix A.l. The solution is in dCi. 

Problem ( !T^ for y G dCi is formulated as follows: 

niin-||t/-?/o||2 s.t. |l^y| =p. (A-1) 

Note that we have ignored the constraint \\y\\ 7^ 7 because it holds for 
general ^g and 7 with probability one. Hence, our interest is whether the 
solution to flA.ip satisfies \\y\\ < 7 or not. The additional constraint is not 
important in this respect. 

The problem (]A.1|) has two possible cases as its solution, ij^y = p and 
iJiy = —p. For each case, we can solve the problem using Lagrange multi- 
pliers: 

niinmax- \\y - y^Wl + p{lj^y - () , 
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where ( G {p, — p}. By setting the derivative over y to zero, we get y = 
y^ — pIat. Moreover, by substituting this result above, we derive the op- 
timal p as p = j^illiyQ — Q and the resulting objective function value is 

2^ (liv?/o "" C) • The constraint C, = p oi C, = —p is chosen so that this 
objective function value is minimized. Obviously, C, = p is optimal for the 
case when ijrj/g > 0, while C = — p for l^^g < 0- Thus, the overall solution 



to problem flA.ip is 



y = yo 



Ijv^o-psgii(ljv^o) ^ 



N ■ 



Appendix A. 2. The solution is in dC2 for q = I. 

When the solution is in dC2, the problem is formulated as 

min-||?/-?/o||2 s.t. \\y\\g = ^. 



(A.2) 



Here, the shape of the constraint boundary changes according to the value 
of q. Fo r general q £ [1 , cxp] , there exist several algorithms to solve this 
problem ( JBoyd and Vandenberghd . |2004J : ISral . 1201 ll ). Especially, for q = 1, 2 



and oo, solutions are available in a ve ry efficient manner. 

For g = 1, it has been shown by iHonorio and Samaras! ( 20101 ) that the 
problem is equivalent to the following Continuous Quadratic Knapsack Prob- 
lem: 



mm 

z 



^ 1 



N^ 



7: 



(A.3) 



which relates to y by yi = sgn (|/o,i) Zi. iHonorio and SamarasI ( 120101 ) have also 
provided a solution technique for this problem. From the KKT condition, the 
solution to ( IA.3P is Zi^u) = max(|yo,i| — ^, 0) for some constant u. Moreover, 

■ 7. Since iJfZ^u) is a decreasing piecewise 
\yo,i\, we can find a minimum breakpoint uq 



the optimal z/ satisfies iJfZ^u) = 

linear function with breakpoints 

that satisfies l^z(z/o) < 7 by sorting the A^ breakpoints. The optimal u is 

then given as 



z/ 



EiGXo 1^0.^1 " ^ 
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where Xq = {i; ||/o,i| — ^o > 0}. Note that the complexity of this algorithm is 
C(A^log A^) since we conduct a sorting of A^ values Q 

Appendix A. 3. The solution is in dC2 for q = 2. 

The solution to problem flA.2p for g = 2 is analytically available. We 
solve the problem using Lagrange multipliers: 

^ II l|2 I ■^ n\ l|2 2\ 

mjnmax-||7/-7/o||2 + -(||?/||2-7 )• 

By setting the derivative over y to zero, we get y = ytxI/o- Moreover, from 
the constraint \\y\\2 =7, the solution is 

7 

l|yoll2 

Appendix A. 4- The solution is in dC2 for q = oo. 

The solution of flA.2l) for the case g = oo is much simpler. The problem 
is just a box-constrained least squares, with solution 

7 (if yo,i > 7) 
yo,i (if \yo,i\ < 7) 
-7 (if yo,i < -7) 

which is equivalent to yi = sgn {yo,i) ^^{\yo,i\y'y)- 

Appendix A. 5. The solution is in dCs for q = 1. 

We provide the solution procedure for flTl]) for y G dC^ and q = 1 based 
on the next theorem. 

Theorem 3. Let y be the solution to problem [I4\ ) for y G dCi, and suppose 
it is infeasible in the original problem U^) . Then, the solution to ( fT^y for 
y G dC^ has same signs as y, that is, yiyi > for 1 < i < N. 
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We can further reduce this to expec ted linear time complexity by introducing a ran- 



domized algorithm (jDuchi et al.l . l2008bl ) . 
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From this result, we can factorize the variable indices into two parts, 
X^ = {i; fji > 0} and X_ = {i; jji < 0}. Using this factorization, the objective 
function is expressed as ^ Z]jex+(2/i-l/o,i)^ + | Z]jGX_(l/i-|/o,i)^- The equality 
constraints can also be expressed as J2iex Hi + ^«ex_ Vi ~ C; with ( G 
{p, — p} and ^jgj Hi — J2iei^ Vi — 7- Fi'oki these expressions, we derive two 
independent problems: 

min - J2 (yt - yo,iY s.t. ?/+ > , ^yt = ^— , 



min -^{Vi + yo,i) s.t. y >0, ^y- 

^ ieX- i&X- 



7-C 



The solutions to these problems relate to y in that j/j = yf for i G X+ and 
yj = —y~ for i G X_. These problems are continuous quadratic knapsack 
problems and the solution can be found by using the same algorithm as in 
problem ( lA.Sp . We derive the final solution by solving these problems for the 



two cases C = P and C = — P, and choosing the one with the smaller objective 
function value in ( jT^ . 

Appendix A. 6. The solution is in dC^ for q = 2. 

The solution in the case y G dCs and g = 2 is analytically available. We 
use Lagrange multipliers: 

II l|2 , /-iT /-\ I ^ n\ l|2 2\ 

mmmax - \\y - t/oHa + ^^I^nV - C) + TjlbL - 7 ) , 

y M.A z z 



where ( G {p, — p}. By setting the derivative over y to zero, we get y = 

Y^(yo ~ plAf)- If P = 0, we have p = ^ ° from the constraint ij^y = 0. 
Hence, from \\y\\2 = 7, we get the optimal y as 

\\yQ- filNh 

For the case p > 0, we have -^ = —i — from the constraint iTj-y = C- 

Hence, we have a quadratic equation in p from the constraint ||y||2 = 7^: 

2 II 1 ||2 2/1 T AT \2 



32 



Solving this equation gives the optimal y as 

where r = (I^t/q)^ — N 2j^_ 2 — ^—- By substituting this result into 

\\y-yo\\l, we have 

Since A^ II1/0II2 ~ (l^vl/o)^ — O5 ^^e niinimum of this value is achieved by 
choosing ( and a sign in /i as C = sgn (I^t/q) p and — sgn (I^^q). Thus, the 
overall result is 

y = sgn {llyo) J" {y^ - filp,) , 

Appendix A. 7. The solution is in dCs for q = 00. 

The solution for ( TT4l) with y G SCs and q = 00 has two possible cases, 
l^y = p and l^y = — p, where for each case the problem is: 

^ 1 
mill ^T^iVi- yo,if s.t. 1^?/ = C , -7!^ <y < I'i-N , (A.4) 

with C, G {p, — p}. Here, the constraint \y\^ = 7 is relaxed to \y\^ < 7. 
However, if the solution to (1A.4P satisfies \\y\\^ < p, it has already been 



found as a solution to ( 1T^ for y G (9Ci and therefore this relaxation does not 
affect the overall procedure. 

Since problem (lA.4p is a variant of the continuous quadratic knapsack 
problem, a similar strategy to (1A.3P is available. From the KKT condition, 
the solution to (IA.4P is of the form yi(z/) = sgn (?/o,j — ^) iiiin(|?/o,i — ^\,l) for 



some constant u. Moreover, the optimal v satisfies l}^y{iy) = (. Since lj^y{iy) 
is a decreasing piecewise linear function with breakpoints {1/0,1—1, yo,i+l} 



N 



we can find a minimum breakpoint uq that satisfies l^t/(z/o) ^ C by sorting 
the 2N breakpoints. The optimal z/ is then 



EiGX2 ^0.^ + ^(1^1 1 - |2^3|)-C 



(if 2:2^ 



^=< IX2I 

i/Q (if X2 
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where Xi = {i; yo,i 

{i-.yoA-^o < -7}- 



t^o > 1}, ^2 = {i; -7 < yo,i - t'o < 7} and X3 



Appendix B. Generation of Synthetic Precision Matrices 

Here, we present the detailed procedure used to generate the sparse pre- 
cision matrices with a common substructure in Section O The procedure is 
composed of two sequential steps. We first generate a single precision matrix, 
which is the common substructure in the resulting A^ matrices. Then, we add 
some non-zero entries to get a matrix Aj. This additional pattern is chosen 
to be unique for each matrix so that the resultant matrices Ai, A2, . . . , Aat 
satisfy the additive model assumption ([7]). In the following two subsections, 
we explain the above steps. 



Appendix B.l. Generation of a Sparse Precision Matrix 

In several previous studies, synthetic sparse precision matrices are gener- 
ated in a naive manner, that is, just adding a properly scaled identity matrix 
to a sparse s ymmetric matrix so that the resulting matrix is sparse and p osi- 
tive definite dBaneriee et all . I2OO8I : IWang et all . I2OO9I : Ihi and Tohl . I2OI0I 1. In 
our simulations, we take a different approach to generating a sparse precision 
matrix for compatibility with the next step. 

Our approach is based on an eigen-decomposition A = VDV^ , where 
D is a matrix with eigenvalues on its diagonal and V is an orthonormal 
matrix such that V^V = VV~^ = Id- Here, we use the fact that A is sparse 
if V is sufficiently sparse and the problem can be reduced to generating a 
sparse or thonormal matrix V. This c an be done easily by applying a Givens 
rotation (JGolub and Van Loanl . Il996l ) to an identity matrix 7^. Formally, we 
let V^^^ = Id and apply the following procedure repeatedly until the desired 
sparsity is achieved. 

1. Randomly pick two indices j,j' from {1, 2, ... , d}. 

2. Randomly generate 6 from a uniform distribution U{[0, 27r]). 

3. Update the {j,j), {j,j'), ij',j) and (j', j')th entries of V^^'^ as 
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^33' 



3'3 



(fc+1) T.(fc+1) 



yy^-r 
^3'3' 



^ 



COS 6 — sin 9 
sin 6 cos 6 
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3' 3 



ylk) 
^3'3' 



r{k+l) 



r{k) 



4. Keep the remaining entries V^^^^T"' ^ V>J, for (jo, Jo) ^ { iJJ), iJJ'), 
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In our simulations, we generated each eigenvalue from a uniform distribution 

f/([0,l]). 

Appendix B.2. Generation of Sparse Precision Matrices with a Common 
Substructure 

Here, we turn to imposing commonness on the resulting precision matri- 
ces. To begin with, we generate small sparse precision matrices \l'i, \1'2, • • • , ^E'a 
in the preceding manner and construct a sparse block-diagonal precision ma- 
trix Aq = block — diag (\l'i, \I'2, • • • , ^E'a)- We then add some non-zero entries 
to Aq and generate A^ precision matrices Ai, A2, . . . , A^v- At this stage, we 
keep the original non-zero entries Aq unchanged so they form a common sub- 
structure at the end. Note that the addition of non-zero entries can not be 
done randomly since this might destroy the positive definiteness. 

We describe the procedure for the case a = 2. Let the eigen-decompositions 
of ^1 and ^2 be ^1 = ViDiV^ and ^2 = 14^2^^- Note that Vi and V2 
are sparse since they are generated to be so. Now, let matrix Aj be of 

^1 $^ 
$7 ^2 

matrix $j while keeping the positive definiteness of Aj. This corresponds 
to keeping a determinant of Aj positive. Here, we choose $j of the form 
<l>j = Vl''E:iV2~^ where Hj is a 6 x 6 diagonal matrix and Vi and V2 are ma- 
trices composed of b columns in Vi and V2, respectively. Specifically, we 
let Vi = [ vi,i vi,2 • • • f i,di ] and V2 = [ '^2,1 '^'2,2 • • • ^^2,^2 ] , where 
di and ^2 denote the dimensionality of each matrix. Then V^ and V2 are 

Vl = [ ^l.vri.i «l,7ri,2 • • • t^l,7ri,6 ] aud V^ = [ t;2,7r2,i ^'2,7r2.2 • • • '«^2,7r2.f, ] , 

respectively, for some index sets {tti^i, 111^2, • • • , '^i,b} ^ {1, 2, . . . , rfi}, { 112,1, 
T^2,2i ■ ■ ■ 1 7r2,fe} ^ {1? 2, ... , ^2}- Then, from a general matrix property, we can 
express the determinant of Aj as 

det Aj = det (^1 - $j^^^$7) 

= det (Z^i - V^^,V2D2^y2^Jyx) 



the form Aj 



The objective is to generate a sparse non-zero 



n 






1 V ^2 



where D^ = diag(o-i,i, 0-1,2, • • • , o-i,di), D2 = diag((T2,i, 0-2,2, • • • , 0-2,^2) and Sj = 
diag(^j,i,^j,2, . . . ,^i,b)- Hence, the positive definiteness of Aj is guaranteed if 
^im < '^i,7ri„o"2,7r2m ^^ Satisfied for 1 < m < 6. Moreover, this inequality 
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provides us a guideline on choosing index sets. Since we want non-zero entries 
of <l>j to be larger, which can be achieved by larger \^i^m\, "we choose index sets 
so that CTi 7ri^o"2,7r2,„ large. This corresponds to choosing leading eigenvalues 
and eigenvectors of \l/i and \1'2. In our simulations, we pick 6 = 2 indices 
at random from those with eigenvalues in the top 1/3. We also generate 
Ci,m as ^i,„ = ^0Amv/c^i,7ri,™O'2,7r2™, where ^o,i,m follows a uniform distribution 
f/([-0.8,-0.5]U[0.5,0.8]). 

For general a > 2 cases, we first construct a matrix A,- from \l/i and \l/2. 
We then iteratively apply the above procedure to generate A^ from A^- and 
\l/3, Al from A,- and ^4, until Aj = A^ is derived. In the simulations 
in Section |5l we set the number of modules to a = 2 for d = 25, a = 3 for 
d = 50 and a = 4 ioT d = 100. 



Appendix C. Proof of Theorems 

Appendix C.l. Proof of Proposition [I] 

Let E and Fi be non- negative dxd matrices satisfying —Ejji < Qjji < Ejji 
and —Fijji < ^ijj' < -^ij/, respectively, for all 1 < z < A^ and 1 < j,j' < d. 
Then, using Lagrange multipliers F, Fq, and {Aj, Ao,j}^i, the CSSL problem 
dH]) is expressed as 



N 



max 

e,E,{n„F, 



min V ti {logdet(e + Vt,) - tr [S,{& + Vt,)]} 

}f^,r,ro,{A„Ao,jf=,tr 

- tr [F0] + tr [abs(F)E] + tr [Fq^;] 

N 

- Y, {tr [A,a] - tr [abs(A,)F,] - tr [Ao,.Fi]} 
s.t. Fojy > , Ao,.ij,v > {l<i<N,l<J,f <d). 

By changing the order of maximization and minimization above, we derive 
the dual problem. Now, we optimize each variable G, E, Qi and Fi by setting 
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each derivative to zero: 



N 



^t,{(e + a)~'-5,}-r = o,xd 



j=i 



- pldlj + abs(r) + To = Odxd , 

t, {(6 + a)-' - 5,} - A, = Orfx, (1 < 2 < iV) , 

- ^ ( E ^.y 1 ^^'n' + \^i,n' I + ^o,,n' = 

(l<^<iV,l<j,/<rf). 
As a result of these equations, we get 

Ai = u {{Q + n^y^ - Si} , 

<P (i<j,j'<rf), 



TV 



N 



J2\^^,n'\'] <7 (l<J,/<rf) 



.j=i 



and so the dual problem is given by (|9]) where we set Wj = (9 + f2j) ^ = 

Appendix C.2. Proof of TheoremUl 

We first prove the lower-bound. Let Wi = -f A, + Si in the dual prob- 



lem (|9]). Then we have 
hence 



^^^ 



Ei=i^w < P and Ei=il^ 



^Af 



^« Jj' I 



< 7, and 



-Ai + S, 



S ''* 

^ d . ^ I 11 ^ 

< — max Aj ,•,•/ + Lb, 



U J,3 

d 



i,3f 



■'Mis 



< — maxmax lAj ,-.;/| + IISjIL 

J. . ,- ^-Z ' '^J I II HO 



t 



« JJ 



< ^+11^-11 

^ ^ + ll^«lls ' 
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where the last inequahty comes from the general relationship between ip- 

norms max^ \^^,jf\ < (E^i I^^jj'I^) '• Since W* = j-A* + Si = A*"^ holds 
at the optimum, we have the lower-bound. 

We now turn to proving the upper-bound. From strong duality, the 
duality-gap is zero at the optimal solution to the primal and the dual prob- 
lems ([8]), ([9]), and we have 

N 



p ||e*||, + 7 llf^lli,, = d - J^t.tr [5,(6* + n*)] 



i=l 



Moreover, from < p < Np'y < oo, tr[S'j(0* + fi*)] > and the general 
£p-norm rule feii l^ljj'H " ^ ^^^i l^ljj'l^ 






holds. Since Np > 1 for p > 1, we get 

Nvd 



19*11 + \\Q*\\ < 

\^ 111 ^ II" lll,oo — 



P 



We use this inequality to derive the upper-bound. From the definition, the 
precision matrix factorizes as A* = Q* + Q*, and hence we have 

||A*|| < 119*11 _L ||n*|| 

IM»-, lis ^11'-' lis ^ IP'i lis 

< ||e*|L + <imax|a*..,| 
- II lib .., i,jj 

< ||0*|L + (imaxmax |r2*,;_;| 

< 119*11 _L f] \\n*\\ 

^ ||w iig -I- a ||ii 11-,^^^ 

<rf(l|01ls + ll^1ll,oo) 

<d{\\e% + \\n%^J) 

N-v<f 
P 
Here, we have used the relationship ||0*||g < ||0*||2 ^ l|0*lli- ^ 
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Appendix C. 3. Proof of Theorem [H 

The Hessian matrix of the CSSL primal loss ^1,1=1 td{^ + ^u Si) is given 
by 



/t-nri 



primal 



\El,UK. 


tiKi 


t2K2 


^nKm 


tiKi 


tiKi 


0^2x^2 


^cPxcP 


t2K2 


0^2x^2 


t2K2 


^cPxcP 


^nKm 


0^2x^2 




^d?xd? ^nKm 



where i^j = (6 + Vti) ^ ®{Q + Vti) ^. It is easy to verify that 1n+i ® Id spans 
a null space of "H primal and thus ?^ primal is always rank-deficient. 

On the other hand, the matrix of the CSSL dual loss — Ylii=i ^i logdet Wi 
is the block-diagonal matrix 



"Hduai = block-diag(tiiri, ^2^2, • • • , tNK. 



N) 



where Ki = Wf^ ® W^^^. From Theorem [H we know that the CSSL solution 
has bounded eigenvalues and thus the above Hessian matrix is always strictly 
positive definite for any feasible Wi. D 



Appendix C.4- Proof of the Proposition\^ 

Let Si be the covariance matrix Si - 
bound for ( TT^ of 

N 



ai Ti 
Ti hi 



. Then we have an upper- 



y^tj. [\og{uiVi - {6 + UiY) - {ttiUi + biVi + 2rid + 2riUJi)] 



i=l 



-2p\e\ -27||a;| 



N 



< 



y^tj {\og{uiVi - (6* + ujif) - {ttiUi + hiVi) - 2{riUi + 71^*1)} 



i=l 



N 



2[Y^Ur,e + p\e\ 



i=l 



^Af 



from the relationship '^i^iti\^i\ < ll'^lloo — ll'^^IL- Moreover, this upper- 
bound coincides with the original problem when oj = Otv- Therefore, if 
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a; = Oat is a inaximizer of this upper-bound, it is also a maximizer of (llSp . 
From the derivative of the upper-bound over Wj, we get that Wj = is a 
maximizer if the following condition holds: 

-(7 + ri) < ^ < (7 - ri) . 

This is a sufficient condition for the original problem (TT5|) to have cjj = as 
its solution. Under this condition, problem (fT5|) can be expressed as 

max log('U'y — 6*^) — {au + hv) — 2{r9 + p\9\) 

8,u,v,Ui,Vi 

s.t. uv - 9'^ > , 

-i7 + n)< ^<(7-r.) il<t<N) 

UiVi - 9^ 

for some properly chosen d,b and f = Yli=iti^i- Hence, since the additional 
condition involves ^ = irrelevant to the value of Ui and V j if maxi<j<7v \ri\ < 



7 holds, we have 9 = when |r| < p from llde et al.l (120091 . Proposition 1). D 



Appendix C. 5. Proof of Theorem 

Let h{y) = kWy — yoWl and y' be one of the feasible solutions to the origi- 



nal problem ( ITSj) . Moreover, since y is infeasible for the original problem (TT3|) . 
\\y\\^ > 7 holds. Then, for y" = y' + e{y - y') with < e < 1, h{y") < h{y') 
holds from the convexity of h. Therefore, y" is a better solution to prob- 
lem (IT3l) as long as the constraints |l7v2/"| < P and \\y"\\g < 7 are satisfied. 

The first condition always holds because ll^v?^"! ^ (l~^)|l)vy'l+^|l7vi/l — P- 

1 

On the other hand, the latter condition ||y"|| = (X]j=i ll/i'M' < 7 is no 
longer valid if \\y'\\ = 7 and sgn (y^) = sgn {iji — y[), which results in yiy[ > 0. 
This is a necessary condition for the solution to (!T3l) . Otherwise, we can al- 
ways improve the solution by the above procedure, which contradicts its 
optimality. D 
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Table 3: Simulation results for three cases {d — 25, 50 and 100) with N ^ 5 datasets eval- 
uated by weighted precision, recall and F- measure, denoted by "Prec", "Rec." and "F" in 
the table, respectively. The "Fq" denotes the Fo-measure for zero pattern identification. 
Each simulation is conducted so that each dataset has 5d data points, and the measure- 
ments are averaged over 100 random realization of datasets. The numbers in brackets are 
standard deviations of each measurement. Each of the three rows in SICS and MSICS 
corresponds to results for eo = 0.5, 0.7 and 0.9 from the top. We highlight the top three 
results for each measurement in bold font (except for "Fq"). 







CSSL 


CSSL 


CSSL 


CSSL 


CT 


rta 


MSICS 


MSICS 






{p-- 


= 1) 


(P = 


= 2) 


{p = 


oo) 


(7 = 


: oo) 


blL-O 


(P = 


= 2) 


{P = 


oo) 




o 


















.14 


;-i4) 


.38 


.21) 


.54 


'.23) 






.84 ( 


;.i9) 


.70 ( 


;.i6) 


.56 ( 


;.i9) 


.48 


;.2o) 


.20 


M6) 


.43 


.21) 


.49 


;.2i) 




















.33 


'.16) 


.41 


.19) 


.45 


M9) 




















.07 


.07) 


.48 


.24) 


.60 


.24) 




0) 


.45 


.32) 


.82 


;-i4) 


.84 ( 


;.i2) 


.86 ( 


;-ii) 


.23 


M8) 


.74 


.19) 


.74 


M9) 


TS 




















.80 


;.2o) 


.83 


.13) 


.86 ( 


;.ii) 




















.09 


'.08) 


.41 


.21) 


.55 


'.23) 




fe 


.56 


.22) 


.75 ( 


;-i4) 


.66 ( 


:-i7) 


.60 ( 


;.i9) 


.21 

.45 


M6) 

M8) 


.53 
.53 


.21) 
.19) 


.58 
.58 


'.20) 

M8) 


^ 


.92 


.02) 


.92 


;.02) 


.92 


.02) 


.92 


;.02) 


.92 


;.02) 


.93 


.02) 


.92 


;.02) 




O 


















.10 


'.13) 


.24 


.20) 


.58 ( 


;.i9) 




0) 


.87 ( 


:-ii) 


.69 ( 


;.i4) 


.56 


.17) 


.47 


;-i7) 


.13 


;-i4) 


.37 


.20) 


.52 


M9) 


o 


















.27 


M9) 


.42 


.18) 


.47 


'.18) 




















.04 


'.04) 


.18 


.19) 


.60 


M9) 






.41 


.20) 


.83 


Ml) 


.85 ( 


;.io) 


.91 ( 


;.05) 


.10 


;-ii) 


.51 


.21) 


.72 


M6) 


TS 




















.50 


;.22) 


.81 


.12) 


.86 ( 


;.08) 




















.05 


'.06) 


.20 


.19) 


.58 


M9) 




fc 


.53 


.20) 


.75 ( 


;.i2) 


.66 ( 


;.i5) 


.61 ( 


;.i5) 


.10 
.34 


'.20) 


.42 

.54 


.20) 

.17) 


.59 
.60 


M8) 
M6) 


o 


.90 


.03) 


.90 


'.02) 


.89 


.02) 


.89 


'.03) 


.89 


'.03) 


.90 


.02) 


.90 


'.03) 




o 


















.09 


;-ii) 


.17 


.14) 


.68 ( 


;.i5) 




0) 


.91 ( 


;.o7) 


.78 ( 


;.io) 


.64 


.14) 


.53 


'.15) 


.10 


M2) 


.33 


.21) 


.62 


M6) 


o 
o 


















.22 


M7) 


.46 


.18) 


.55 


M6) 




















.03 


MO) 


.06 


.10) 


.59 


;-i7) 


II 


0) 


.37 


.18) 


.81 


Ml) 


.83 ( 


:-ii) 


.95 ( 


;.02) 


.06 


MO) 


.25 


.21) 


.67 


'.15) 


TS 




















.24 


M9) 


.67 


.16) 


.82 ( 


;.09) 




















.05 


MO) 


.08 


.11) 


.63 


M6) 




fc 


.51 


.19) 


.79 ( 


;.io) 


.72 ( 


;.i2) 


.67 ( 


;.i2) 


.07 
.22 


MO) 

'.18) 


.28 
.54 


.21) 

.17) 


.64 

.65 


M5) 

;-i4) 




.87 


.04) 


.87 


;.o4) 


.87 


.03) 


.87 


'.03) 


.87 


;.03) 


.88 


.04) 


.87 


'.03) 



46 



Table 4: Anomaly detection results: The simulation is conducted for 4 different settings, 
[rin, n{] = [4, 1], [12, 3], [20, 5] and [40, 10]. For each method, we compute precision matrices 
for 11 different values of a ranging from 10^^'^ to 10^°'^. The table shows the median 
of the best AUCs among these 11 results over 100 random realizations of datasets. The 
numbers in brackets are 25% and 75% quantiles. The bold font represents the top three 
resuhs, which are CSSL {p = 2), CSSL (p = oo) and MSICS {p == oo) for all settings. 





[niiiTif] = [4, 1] 


[nn,nf = [12,3 






best AUC 


a 


best AUC 


a 


CSSL {p = 1) 
CSSL {p = 2) 
CSSL {p = oo) 

SICS 
MSICS {p = 2) 
MSICS {p = oo) 


.975 (.950 / .987) 
.987 (.963 / 1.00) 
.987 (.963 / 1.00) 

.975 (.938 / .987) 

.975 (.950 / .987) 

.987 (.963 / 1.00) 


10-0.9 

10-0.9 
10-0.9 
10-0.5 
10-0.8 

10-11 


.975 (.950 / 1.00) 
.987 (.963 / 1.00) 
1.00 (.987 / 1.00) 

.975 (.938 / .987) 

.975 (.950 / .987) 

.987 (.975 / 1.00) 


10-0.9 

10-0.9 
10-0.9 
10-0.5 
10-0.7 

10-1-2 




n-a, nt] = 20, 5 




nn,nf] = 40, 10] 




best AUC 


a 


best AUC 


a 


CSSL {p = 1) 
CSSL Ip = 2) 
CSSL {p = oo) 

SICS 
MSICS {p = 2) 
MSICS (p = oo) 


.975 (.950 / 1.00) 
1.00 (.975 / 1.00) 
1.00 (.987 / 1.00) 

.975 (.950 / .987) 

.975 (.950 / .987) 

.987 (.975 / 1.00) 


10-0.9 
10-0.8 
10-0.9 
10-0.5 

10-1.0 
10-11 


.975 (.963 / 1.00) 
.987 (.963 / 1.00) 
1.00 (.987 / 1.00) 

.975 (.950 / .987) 

.975 (.950 / .987) 

.987 (.975 / 1.00) 


10-0.9 
10-0.8 
10-0.9 
10-0.5 

10-1.0 

10-0.9 



47 



